Scalable and efficient DNA sequencing analysis on different compute infrastructures aiding variant discovery

Abstract DNA variation analysis has become indispensable in many aspects of modern biomedicine, most prominently in the comparison of normal and tumor samples. Thousands of samples are collected in local sequencing efforts and public databases requiring highly scalable, portable, and automated workflows for streamlined processing. Here, we present nf-core/sarek 3, a well-established, comprehensive variant calling and annotation pipeline for germline and somatic samples. It is suitable for any genome with a known reference. We present a full rewrite of the original pipeline showing a significant reduction of storage requirements by using the CRAM format and runtime by increasing intra-sample parallelization. Both are leading to a 70% cost reduction in commercial clouds enabling users to do large-scale and cross-platform data analysis while keeping costs and CO2 emissions low. The code is available at https://nf-co.re/sarek.


Introduction
Genomic variation analysis of short-read data has become a key step for modern personalized medicine as well as for fundamental biomedical research.In particular, for biomedical assessment, it is used for characterizing genomes of samples taken from both healthy or tumor tissue.In clinical applications, the resulting information can be used to classify tumors and support treatment decisions (1)(2)(3) or research questions, such as drug development ( 4 ) or identify variations of interest in larger cohorts for further studies ( 5 ,6 ).The technologies and protocols for generating DNA sequencing data vary a lot.Each of the technologies comes with different specialties ranging from targeted gene panels and whole exomes (WES) to whole genomes (WGS) resulting in raw data files from a few to hundreds of gigabytes (GB).Various project-specific factors play a role in choosing the appropriate sequencing technolo-gies, such as the particular type of genomic variations of interest, the cost for sequencing, analysis, and data storage or turnaround times ( 7 ).Panel and exome sequencing is cheaper than WGS ( 8 ).Targeting defined regions allows for having high coverage in these regions.Hence, single nucleotide variants (SNVs) and small insertions and deletions (Indels) can be determined with high confidence.WGS, on the other hand, can be used to additionally investigate more complex alterations such as non-coding variants, large structural variants (SV) and copy-number variations (CNV).Another aspect is ethical considerations on how to handle 'accidentally detected' genomic variation in non-targeted genes which had been identified during the whole genome or whole exome sequencing ( 9 ,10 ).
Examples of large-scale genomics collection projects are TCGA / ICGC or the 100 000 Genomes Project.Some 6800 whole-genome samples from the former were uniformly processed for the 'Pan Cancer Analysis of Whole Genomes' study to obtain a consistent set of somatic mutation calls ( 11 ).More than 12 000 whole-genome samples from the latter were analyzed with respect to their mutational signatures to gain insights into tissue-specific markers ( 12 ).There are several national and international initiatives that aim at gathering more and more sequenced genomes, such as the Estonia Genome Project, the German Human Genome-Phenome Archive, the Iceland Genome Project ( 13 ), or the European '1+ Million Genomes' Initiative.Such studies often encompass many patients and their samples are often collected over longer periods of time at multiple sites.This requires stable, and reproducible pipelines that can be run on a variety of different high-performance clusters and cloud setups with differing scheduling system for distributed and homogeneous data processing ( 14 ).
Several pipelines ( 11 ,15-18 ) have been published in different workflow languages to automatically process reads from FastQ files to called (and annotated) variants accompanied by countless in-house workflows.With a certain variety of tools, the workflows usually encompass: quality control steps, read trimming, mapping, duplicate marking, base quality score recalibration, variant calling, and possibly annotation (see Figure 1 ).
While there are many workflows available, nfcore / sarek ( 15 ) stands out with its ability to process germline, tumor-only, and paired samples in one run.It can perform SNV / Indel, SV and CNV calling, as well as micro-satellite instability (MSI) analysis of WGS, WES, and panel data with currently 12 different tools.Since the pipeline is written in Nextflow ( 20 ), it benefits from the portability to any supported infrastructure, in particular several cloud vendors and common HPC schedulers enabling cross-platform homogeneous data processing.Furthermore, the pipeline allows the processing of non-model organisms.While the reference genomes and databases are most comprehensively provided for human and mouse genomes as well as subsets for many other organisms, they can be generated and saved for future runs for non-supported species.nf-core / sarek is part of the nf-core community project ( 21 ) and has a growing user base with now 242 stars on GitHub and 47.47K unique repository visitors since July 2019 (as of 1 June 2023) who additionally contributes with supporting and improving the code base either by direct contributions, suggesting features, or raising issues.
The pipeline has been used within the field of cancer research (22)(23)(24)(25)(26)(27) and beyond, such as the identification of rare variants in tinnitus patients ( 28 ), finding SNPs in driver genes related to stress-response in cowpeas ( 29 ), the genomic profiling of wild and commercial bumble bee populations ( 30 ), or the Personal Genome Project-UK ( 31 ).
Here, we present a re-implementation of the nf-core / sarek pipeline using the Nextflow DSL2 framework, an extension of the Nextflow syntax allowing to develop pipelines in a modular fashion, which increases user-based customization to maintain a modern pipeline.The re-implementation is focused on reducing required compute resources for efficient runs on different infrastructures.Minimizing required computing resources has always been of large interest.In particular, in the genomics space, more users run their calculations on several commercial and non-commercial cloud platforms ( 14 ).Commercial platforms usually come with a pay-per-use model, thus there is a high interest to reduce costs due to finite funding.
For non-commercial platforms or local clusters, direct costs are possibly of lower interest, however, reducing required resources allows for processing more samples in a shorter time frame.Furthermore, all used tools have been updated to their latest version upon release.For various steps, new tool options have been added, i.e. mapping with DragMap or variant calling with DeepVariant ( 32 ), or fastP ( 33 ) for adapter trimming.
Using this re-implementation, we show for the first time that population-scale homogeneous recomputing of WGS on commercial clouds is possible.
Our findings demonstrate a 69% reduction in compute costs when utilizing the nf-core / sarek 3.1.1pipeline, in comparison to a previous version, nf-core / sarek 2.7.2.This translates to costs of just $20 for comprehensive germline short and structural variant calling, and annotation.

Materials and methods
Implementation nf-core / sarek is a Nextflow-based pipeline that has been part of the nf-core project since release 2.5.Thus, nf-core / sarek is based on the nf-core template, which provides a code and documentation skeleton to ensure current best practices.The pipeline was one of the first to be ported from Nextflow's domain specific language version 1 (DSL1) to DSL2.The DSL2 framework allows modularization and code sharing.78 of 80 modules used in nf-core / sarek have been made available in the nf-core community's shared repository, nfcore / modules, implementing Nextflow wrappers around ideally individual tools.The tools are typically accessible through (bio)conda ( 34 ) and have a corresponding docker and singularity container provided by the Biocontainers ( 35 ) community enabling portability and reproducibility for each such 'module'.This single-tool-per-process approach ensures that previously occurring dependency conflicts are mitigated.nfcore / tools, a helper tool for users and developers, allows easy creation, installation and re-use of these modules, which will be important for further extensions of nf-core / sarek.

Data sets and compute environments
In order to evaluate the computational requirements of the pipeline, five tumor-normal paired samples from the ICGC LICA-FR ( 36 ) cohort are used (see Table 1 ).The unaligned BAM files are downloaded and converted to paired-end FastQ files using nf-core / bamtofastq v1.0.0 (formerly qbicpipelines / bamtofastq) with Nextflow version 20.10.0 and singularity.
Unless otherwise indicated, evaluations are done with Nextflow version 22.10.2build 5832 and Singularity 3.8.7-1.e18 on a shared HPC cluster.A parallel BeeGFS filesystem ( 37 ) is used with one metadata and two storage nodes.Each storage node has two raid systems with 10 * 14 TB disks respectively.The data systems are connected with a 50 GB ethernet connection.The HPC is using Slurm as scheduler and consists of 24 nodes with 32 cores and 64 threads each (2 * AMD EPYC 7343) with 512GB RAM and 2TB NVMEe disks as well as four nodes with 64 cores and 128 threads each (2 * AMD EPYC 7513) with 20248GB and RAM and 4TB NVMe disks, these NVMes are utilized via Nextflow scratch option.To increase the speed and decrease the load on the filesystem, calculations are therefore performed directly on the NVMe, and only the results are written back to the BeeGFS.The clus-  S1 .nf-core / sarek, Ov arflo w, and TOSCA ha v e additional option f or base quality score recalibration.All pipelines allo w v ariant calling and annotation.T he varying supported variant calling types are highlighted for each pipeline respectively.For the One-touch pipelines (OTP), separate workflows have to be triggered for each variant calling type with build-in annotation.nf-core / sarek stands out by covering germline, tumor-only, and paired variant calling f ollo w ed b y annotation across whole genome, whole-e x ome, and panel sequencing data.T he code is a v ailable online and w ell-documented, implemented in Ne xtflo w to enable port abilit y to various infrastr uct ures and supported by an active community.

Reducing storage requirements
In this release, the internal file format following duplicate marking is changed to using CRAM files.To evaluate the required compute resources and the actual data footprint for both file formats, five paired ICGC genomes are run through all tools part of nf-core / sarek 3.1.1and an altered version of nf-core / sarek 3.1.1that uses the BAM format instead.For each process, singularity containers from the Biocontainers registry are used.Each run configuration is repeated three times.Unless otherwise specified, the default parameters for nf-core / sarek 3.1.1are used.We evaluate the pre-processing and variant calling steps independently.
We evaluate all processes corresponding to FastQ quality control, aligning the reads to the reference genome, duplicate marking, base quality score recalibration (BQSR), and quality control of aligned reads.The command for running the preprocessing steps is the following: Listing 3: Pre-processing nextflow run nf-core/sarek -r 3.1.1-profil e cfc --input ./input.csv The memory requirements for BWA-MEM are increased to 60 GB, as well as the runtime for GATK4 Markduplicates and SAMtools merge to 16h and 8h respectively from the provided defaults.
Secondly, we evaluate variant calling with all tools for all germline and paired somatic variant calling for all samples: Listing 4: Variant calling nextflow run nf-core / sarek -r 3.1.1-profile cfc -input recalibrated.csv-tools deepvariant,haplotypecaller,mutect2,strelka, freebayes,ascat,controlfreec, cnvkit,manta,tiddit,msisensorpro -step vari-ant_calling The requested time for all processes is increased to 144h to mitigate interruptions due to runtime time-outs by providing a custom configuration file.

Evaluation of pipeline runtime and resource usage
In order to evaluate the impact on runtime and storage requirements, ten samples are run with different sizes of scattered groups: 1, 10, 21, 40, 78 and 124 (default).In order to generate the respective interval group sizes the parameter -nucleotides_per_second is set to 5 000 000, 400 000, 200 000, 70 000, 10 001, 1000 (default).For read splitting, fastP is run with various numbers of CPUs specified (respectively, 0, 4, 8, 12 (default), and 16) since the tools generates chunks firstly by number of CPUs and secondly by the maximum number of entries per file as defined by a parameter.Here, it is set to 500 000 000 to prevent any further subdivision.

Benchmarking of short variants against truth VCFs
In order to benchmark the variants called by the pipeline, both the germline and paired variant calling tools are evaluated on three datasets each: for the germline callers the GiaB datasets HG002-HG004 ( 38 ) are used, for the paired somatic callers three WES datasets from the Sequencing Quality Control Phase II Consortium ( 39 ).Comparisons are made only in the high-confidence regions defined for the benchmark, i.e. omitting difficult-to-call regions.
Short variant calls from Haplotypecaller, Deepvariant, Freebayes and Strelka2 mapped with Dragmap (base quality recalibration is skipped), BWA-MEM, and BWA-MEM2 are included in the analysis.Deepvariant is evaluated only on HG003.
Furthermore, reads for the sample HG002 sequenced with MGISEQ and BGISEQ500 are downloaded from the manu-facturer.They are downsampled to 20 ×, 30 ×, 40 × and 50 × using seqtk and subsequently processed with nf-core / sarek 3.1.1using default parameters and all eligible variant callers.Evaluation against the truth VCF is done using hap.py.

Comparison of copy number calls against PCAWG samples
In order to evaluate the copy number calls, WGS alignment files for five PCAWG patients (DO44888, DO44930, DO44890, DO44919, DO44889) were downloaded and reprocessed with nf-core / sarek.In addition, the provided copy number calls were downloaded from the ICGC portal for each patient.The calls are compared by dividing the calls from each tool into two groups: amplifications and deletions.For each base, it is then determined how many other tools identified for a given the same group.Visualization of the respectively called copy numbers is done using karyoploteR ( 41 ) and CopyNumberPlot ( 42 ).

Portability to AWS cloud and computing cost
In order to evaluate the costs for running nf-core / sarek on AWS batch, the compute environments are created using Tower Forge ( https:// cloud.tower.nf/), with the following settings: spot instance, max CPUs 1000, EBS Auto scale, and fusion mounts enabled.Instance types are chosen by using strategy 'optimal' with the allocation strategy spot_capacity_optimized .The computation is run in AWS region us-east-1.The pipeline runs are launched with Nextflow Tower setting the Nextflow version to 22.10.3 by adding export NXF_VER = 22.10.3 to the pre-run script and process.afterScript= ' sleep60 ' to the config section.
The pipeline is run on the normal sample for DO50970 using default settings together with Strelka2, Manta and VEP.For one paired sample evaluation, the pipeline was run the tumor-normal pair (DO50970 / DO50970).

Pipeline overview and summary of new tools and features
An overview of nf-core / sarek v3.1.1 is shown in Figure 2 .The input data is an nf-core community standardized samplesheet in comma-separated value (CSV) format, that provides all relevant metadata needed for the analysis as well as the paths to the FastQ files.The pipeline has multiple entry points to facilitate (re-)computation of specific steps (e.g.recalibration, variant calling, annotation) by providing a samplesheet with paths to the intermediary (recalibrated) BAM / CRAM files.The pipeline processes input sequencing data in FastQ file format based on GATK best-practice recommendations ( 43 ,44 ).It consists of four major processing units: pre-processing, variant calling, variant annotation and quality control (QC) reporting.

Pre-processing
Enabling homogeneous processing of global genomic resources requires flexibility on the genomic 'raw' input data.To cope with the fact that different data repositories provide their 'primary' data in different formats, nf-core / sarek support both BAM and FastQ as input.When BAMs are provided as starting input, they are converted to FastQ via SAMtools ( 45 ) which allows for fully homogenous processing independent of the provided input format.
The FastQ files are then split into shards with fastP including optional adapter trimming allowing the subsequent alignment step to be run on smaller machines.FastP has been introduced with the release v3.0 and is advantageous over other splitting and adapter removal tools as it combines FastQ sharding and adapter removal into one step, speeding up the computation.With this new implementation, we no longer need to rely on Trim Galore! and Nextflow's native split-Fastq() function.
Version 3.1.1 of nf-core / sarek can handle UMI barcodes, which are used in some protocols to detect low allele frequency variants ( 46 ).The user can opt for using Fulcrumgenomics' fgbio ( https:// github.com/fulcrumgenomics/ fgbio ) tool, which generates a consensus read among the ones carrying the same UMI.It will then use these reads as input for the remaining pre-processing steps.
The split FastQ files are aligned with one of the available mappers, which include BW A-MEM ( 47 ), BW A-MEM2 ( 48 ) or DragMap,( https:// github.com/Illumina/ DRAGMAP )) and name-or coordinate-sorted with SAM-tools.By adding DragMap support, we comprehensively cover the community's needs.We added the missing precomputed reference indices for BWA-MEM2 and DragMap for GRCh38 and GRCh37 to speed up the computation.As recommended by GATK guidelines, we use the entire genome during mapping.Off-target reads for WES and panel analysis are removed according to the provided BED file during the base quality score recalibration (BQSR) step.
By default, the aligned BAMs are then merged, duplicates are marked with GATK4 Markduplicates, and converted to CRAM format in one process to reduce runtime and storage needs.The duplicate marking step was improved by providing name-sorted alignment files to GATK4 MarkDuplicatesSpark.If duplicate marking is skipped, SAMtools is used for merging and conversion to CRAM format.BQSR on the resulting CRAM files is run with GATK4 BaseRecalibrator and GATK4 ApplyBQSR.For both, the GATK Spark implementation is available.Both steps can be skipped, in which case the mapped BAMs are converted to CRAMs using SAMtools.
In order to speed up the computation, genomic regions are processed in parallel following the duplicate marking step.Small regions are grouped and processed together to reduce the number of jobs spun up.By default, interval lists for the complete reference genome provided by GATK are used, enabling scattering by chromosome, and removing unresolved and difficult regions.For targeted sequencing data, we have added support to use the respective target bed files for parallelization as recommended by the GATK guidelines ( https: // gatk.broadinstitute.org/hc/ en-us/ articles/ 360035889551-When-should-I-restrict-my-analysis-to-specific-intervals-, last accessed: 2023-07-17).Previously, BQSR was always run on the intervals provided for WGS, which led to recalibrating off-target reads increasing computational resources needed.We have added further support to allow users to control group size not just for custom interval files, but also for the ones generated from the genomic regions, allowing a more tailored setup.
Variant calling nf-core / sarek includes a comprehensive set of variant callers to obtain SNPs / Indels, SV, MSI, and / or CNV values using a total of 12 tools (Figure 2 ).The variant calling tools have to be selected by the user to ensure the resource footprint is kept low and only necessary tools are run.They are executed in parallel.Newly included tools in the v3.1.1 release are Deepvariant, CNVKit ( 49 ), and Tiddit ( 50 ).Furthermore, Haplotypecaller supports both single sample or joint-germline calling ( 51 ).When both Strelka2 ( 52 ) and Manta ( 53 ) are selected, the candidate Indels from Manta are used for SNP / Indel calling according to the Strelka2 best-practices.We added a new parameter that allows skipping germline-only variant calling for paired samples to further reduce time, costs, and compute resources for somatic variant calling.Furthermore, scatter-gathering is now supported for all applicable variant calling tools across intervals (see Supplementary Figure S1 ).The sharded VCF files are then merged with the GATK4 MergeVCF tool.In this way, we reduce computing demands, by avoiding repeated cycles of (de)compressing the files.

QC reporting
Throughout the pipeline, quality control tools are run, including FastQC before alignment, mosdepth ( 60 ), and SAMtools post-duplicate marking and BQSR, as well as vcftools ( 61 ) and bcftools on called variants.These results are collected into a MultiQC ( 62 ) report together with software version numbers of all the executed tools.The previously used Qualimap ( 63 ), which has no direct CRAM support and requires a high amount of computational resources, has been replaced with mosdepth, a fast quality control tool for alignment files.The tool produces comprehensive output files allowing to visualize, e.g.coverage data with the Integrated Genome Viewer (IGV) ( 64 ) for easy inspection.In addition, we have enabled quality reporting such as mapping statistics on provided CRAM files, when the pipeline is started from variant calling.
All tools have been updated to their latest stable release at the time of writing.For a complete overview of the most important tool changes, see Supplementary Table S2 .

Pipeline skeleton changes
We expanded the continuous integration testing to include all the new functionality, as well as adding md5sum checking of output files wherever possible.Additionally, we added full size tests that are automatically run on each pipeline release.All nfcore pipelines require a full-size test on realistic data upon release to ensure functionality beyond small test data and portability to cloud infrastructures.The datasets used here include the Genome in a Bottle (GiaB) data set HG001 (downsampled to 30 × WGS) for germline variant calling testing and the tumor-normal pair SRR7890919 / SRR7890918 provided by the SEQC2 effort for somatic variant calling testing.Since each of the selected datasets comes with validated VCFs to compare against, they are suited for further benchmarking to investigate the variant calling results.The results for each full size test are displayed on the website nf-co.re/ sarek and are available for anyone to explore or download.
High-quality code readability is achieved by combining modules used in the same analysis context into subworkflows, e.g.variant calling with a specific tool and subsequent indexing of the resulting VCF files.In addition, new analysis steps can be added by providing such encapsulated subworkflows, and obsolete parts can quickly be removed entirely.Furthermore, dividing different analysis steps into subworkflows written in separate files simplifies development, which is often done asynchronously with developers at different institutes.

CRAM format allows for storage space reduction
The pipeline has a large data footprint due to the number of computational steps, input data size and Nextflow's requirement for a work directory with intermediate results to facilitate resuming.In order to ease storage needs as a possible bottleneck, CRAM files are used as of nf-core / sarek 3.They are a more compressed alternative to BAM files storing only differences to the designated reference.A majority of tools post-duplicate marking support CRAM files.The pipeline can handle both BAM files and CRAM files as in-and output to accommodate various usage scenarios.
We evaluated the resource usage of the two alternatives by running five tumor-normal pairs on nf-core / sarek 3.1.1as well as on a branch( https:// github.com/FriederikeHanssen/ sarek/ tree/ bam _ 31 ) based on the release in which the internal format was replaced with BAM.Pre-processing (Figure 3 and ( A ) Average realtime, and maximum CPU and memory usage (peak_rss) as reported by the Nextflow trace file for the main processes.For processes split within a sample (i.e.ApplyBQSR), the task with the highest runtime per sample is shown as the process runtime.Resource usage was compared using the paired Wilcoxon test ( ** P < 0.01, * P < 0.05).Two out of the four shown processes are significantly faster when using CRAMs instead of BAMs at the expense of an increase in memory or CPU usage.( B ) Storage was evaluated by calculating the total size of the work directories of all tasks of the respective process.Each condition was repeated three times for samples of five tumor-normal paired patients.
Supplementary Figure S2 ) and variant calling ( Supplementary Figure S3 ) were evaluated separately.For eleven processes, the CRAM-based setup resulted in a significant decrease in runtime, for ten an increase in memory, and for eleven an increase in CPU hours could be measured.The overall average CPU hours for the pre-processing benchmark on CRAM version was 3252.37, in comparison to BAM 3,761.07.The reduction in CPU hours usage, however, had to be compensated by a 34% increase in memory usage.The overall average total memory usage on the CRAM version was 10,346.8GB,for the BAM version it summed up to 7739.51 GB.The storage usage for the work directory for pre-processing these samples drops by 65%, from 170.4 TB (BAM) to 59.7 TB (CRAM).Processes outputting CRAM files reduce their storage needs by at least a third.In the case of GATK4 ApplyBQSR it was reduced by 64%.Processes operating on CRAM files outputting a different format, e.g.VCFs, show no change in storage usage.

S catter-g ather implementations reduce runtime and resource usage
Scatter / gather implementations are highly relevant for parallel processing approaches across genomic regions for BQSR and variant calling.In this release, we have further extended these options: Before mapping, the input FastQ files can now be split and mapped in parallel.For BQSR and variant calling more options to customize the amount of scattering as well as further support for all eligible variant callers are implemented.
We evaluated the impact of different degrees of FastQ file sharding on the mapping process by investigating the division step (fastP), mapping (BWA-MEM) and subsequent merging (GATK4 Markduplicates).The realtime as reported by the Nextflow trace of the longest running mapping process of any one sample was summed up with the realtime of fastP and GATK4 Markduplicates.The space of the work directories of each involved task was summed up, as well as the CPU hours (see Figure 4 A).The overall runtime for the mapping processes decreases until it reaches a plateau at 12 shards, achieving a reduction of the median runtime to 37%.The storage usage increased as soon as any sharding was done due to the sub-FastQs being written to the disk.The CPU hours remain approximately the same due to the long alignment time for large files (see Supplementary Figure S4 ).
We evaluated the impact of different degrees of scattering across genomic intervals on the recalibration and variant calling process with respect to resource usage.Similarly to the mapping processes, the most extended runtime per interval group per samples of all involved processes for BQSR (GATK4 BaseRecalibrator, GATK4 GatherBQSRReports, GATK4 Ap-plyBQSR and SAMtools merge) and variant calling (calling and GATK4 MergeVCFs) was summed up respectively.Storage usage and CPU hours results for all tasks were added up.The runtime decreased the most for all measured tools (see Figure 4 B and Supplementary Figure S5 -S7 ) when the number of interval groups was set to 21. Raising the number of interval groups did not decrease runtime further.For GATK-based tools, storage usage increased with each further splitting of interval groups.For BQSR storage requirements between 21 intervals groups and 124, the default value, increased by a factor of five.For Deepvariant less storage space was required when applying scattering ( Supplementary Figure S6 ), however, for all other variant callers the storage needs remain on a stable level.The required CPU hours remain stable across various amounts of scattering for all tools.Effect of sharding the input files on the mapping processes, including fastP, BWA-MEM and Markduplicates.The input FastQ files were split into smaller pieces increasing the amounts of shards and the runtime, work directory size and CPU hours were evaluated for each split size.FastP was run with a different number of CPUs corresponding to the desired number of shards.( B ) Effect of parallelizing computations across interval groups on BQSR processes, which include the B aseR ecalibrator, GatherBQSRR eports, ApplyBQSR and SAMtools merge process.When all interv als w ere processed together as one group the memory requests for ApplyBQSR had to be increased.The violin plots show computations on tumor-normal paired samples of five patients.The time was evaluated by summing up the highest realtime per task per sample as reported by the Nextflow trace report.The work directory size and CPU hours are the sums of all involved tasks.

Benchmarking of short variants against truth VCFs
The pipeline's SNP and indel variant callers were evaluated for both the germline and paired somatic analysis tracks, the former on three WGS datasets from Genome in a Bottle (HG002-4) ( 38 ), the latter on three tumor-normal paired WES datasets from the SEQ2 Consortium ( 39 ).The results were compared to the respective 'gold standard' VCFs for high-confidence calls.
We evaluated the precision, recall, and F1 score over all samples.For the germline calls of Deepvariant, only sample HG003 was used since its model was trained on the remaining datasets.The tools' precision, recall, and F1 scores are in accordance with the previously reported FDA precision challenge ( 65 ) results for GiaB samples (see Figure 5 A, B for SNPs, Supplementary Figure S8 for Indels).BWA-MEM and BWA-MEM2 lead to higher recall values than DragMap.Strelka2 together with Manta and DeepVariant perform best in all three evaluated metrics.In addition, we investigated one sample sequenced with MGI and BGISEQ respectively with similar results for all variant callers ( Supplementary Figures S9 and S10 ).
Similarly, we evaluated the precision, recall, and F1-score for the somatic calls.Filtered Mutect2 calls have the highest precision calls for all samples, and FreeBayes ones the highest recall values (Figure 5 C, D for SNVs).The highest F1-Score is measured for Mutect2, followed by Strelka2.For Indels, Strelka2 outperforms all other tools ( Supplementary Figure S8 ).The results are in-line with what has been previously reported ( 66 ).

Comparison of copy-number calls against PCAWG samples
The pipelines' paired somatic copy number calls from ASCAT, CNVKit and ControlFREEC were compared against five samples from the PCAWG ( 11 ) cohort.Each sample has two call sets, one generated with the OTP pipeline and one with the Sanger pipeline (SVCP).In Figure 6 , for each tool the calls for each base are evaluated with respect to how many other tools confirmed the base.Calls are divided into two categories: amplifications or deletions.For the former, for patient DO44890 the bases called by each tool are confirmed by at least 3 other tools.Similarly, for DO44919 with an exception for the OTP results, where a set of bases could not be confirmed by any other tool.For a majority of the CNVKit and ControlFREEC calls were confirmed by three or more tools for each sample.Overall, there are fewer deletions found than amplifications.For the samples DO44890 and DO44919 a majority of the deletions were called by two or more tools.For the remaining samples, calls by CNVKit, ControlFREEC, and SVCP were for a majority of the cases confirmed by one other tool.All calls are visualized in Supplementary Figure S11 .

Portability to AWS and computing costs
The sheer amount of existing genomics data and the unavoidable need for even more data for the detection of diseasecausing genotype-phenotype correlations or population-scale analyses will test the limits of on-premise computing sooner or later.Consequently, more and more data analysis is shifted or supplemented by computation in the cloud.The most recent Nextflow Community survey 2023 indicates that 43% of users use cloud services, which represents an increase of 20% compared to the previous year ( https:// seqera.io/blog/ thestate-of-the-workflow-2023-community-survey-results/, last accessed: 2023-07-17) with AWS still being the most popular among the respondents.Therefore, we evaluated the cost development of nf-core / sarek between 2.7.1 and 3.1.1on AWS Batch.
We were able to reduce cloud computing costs to approximately 30% (see Table 2 ).Furthermore, we could reduce the overall runtime and CPU hours.For a single sample the needed CPU hours are reduced by approximately 70%, and the runtime by 84%.Here, we used spot instances-unused instances auctioned off at a percentage of their on-demand price-whose prices fluctuate constantly.The business models of the cloud providers result in varying spot price percentages and spot prices are frequently subject to change depending on the overall demand.

Discussion
An essential aspect of high-throughput processing is seamless scalability, one of the main advantages of using a dedicated workflow language such as Nextflow.Combining adapter trimming, quality control, and sharding of FastQ files in a single step, as well as more tailored splitting into intervals for variant calling, reduces the needed CPU hours by 66%.Replacing Nextflow's native splitFastq() function with a dedicated process, allows us to make use of all advantages of regular job submission, including assigning resources to the jobs, automatic retries on job failure, and resume functionality.Previous pipeline versions have typically not been able to split the FastQ files and have, thus, missed out on scalability options.Our experiments have shown clear limits for parallelization into interval groups.There is no further benefit to reducing runtime beyond 21 interval groups.However, storage usage increases.This is respected in future releases by setting the default number of interval groups to 21 further reducing the storage footprint of the pipeline.
The switch to using CRAM files results in reduced storage space usage of 65%, at the expense of higher memory require-  ments.The additional memory needs are distributed over the thousands of tasks run for the benchmark.Due to this, in practice, memory is not a limiting factor.The additional needs are only required for the task's run time, usually comprised of a couple of GB at most, and can subsequently be re-used.The storage, however, accumulates over the entire pipeline run and can therefore pose a bottleneck for usage scenarios with large input data sets relative to the available storage on the respective compute system.Both changes, switching to CRAM format reducing storage requirements by two-thirds and reducing the amount of scattering -further reducing storage requirements by a factor of 5, will enable users to run the pipeline on smaller systems more efficiently.While there are many benefits for the scientists running the pipeline, we would like to emphasize also the ecological need to reduce the carbon footprint in computational research.This is relevant in particular considering the evergrowing number of available samples in national and international genome repositories, which aim at facilitating truly comprehensive population-based analysis and understanding underlying genome variations.Our results provide solutions to reduce CO 2 emissions.
One of the main objectives of this work is to enable scientists to run the pipeline in cloud environments at low costs per sample.Cost-efficient cloud computing is increasingly important for data-driven science.Intransparent and unpredictable cost models discourage a scientist.Using the cloud setup as described in this work, we reduced the costs to 50% in comparison to previous releases.Recently, Seqera Labs posted a blog article ( https:// seqera.io/blog/ breakthrough-performanceand-cost-efficiency-with-the-new-fusion-file-system/, last accessed: 2023-07-17) showing further cost reductions when using nonvolatile memory express (NVME) storage, a protocol to accelerate transfer speed, and a new fusion file system promising even cheaper runs in the future.In addition, users can further reduce their cloud costs by selecting compute instances in cheaper AWS regions and fixing the spot price percentage they are willing to pay more tightly by enforcing an upper bound for costs per sample at the expense of possible waiting time for such machines to become available.nf-core / sarek is an established, comprehensive variant calling pipeline in the genomics field, which can be applied to any organism which a reference genome exists.Future releases further simplify analyses with custom references by enabling pre-computation of all needed indices, an interesting feature when multiple users work with organisms on shared systems for which reference files are not provided by default.On request, such references and their corresponding indices and database files can be added to the central resource AWS-iGenomes and made available to the community.
We benchmarked the pipelines performance for germline and somatic small variants against given truth datasets, as well as comparing copy number calls to ones obtained by the PCAWG study.The copy number evaluation highlights the need for validating calls with multiple tools.There is a set of bases showing strong evidence by being called by all tools.However, some tools, CNVKit and ControlFREEC, show a seemingly more conservative approach with calls validated by almost all others, whereas ASCAT, SVCP and OTP generated overall more calls which were confirmed by fewer tools.The tools' performance differs between samples indicating possible further factors need to be taken into account.The similarities between ASCAT and SVCP can be explained by the fact that the SVCP pipeline also uses an earlier version of the ASCAT tool.
The nf-core / sarek rewrite to DSL2 makes the code base more maintainable and easier to read, a factor that is crucial to allow new developers to join the effort with a reasonable learning curve.All pipeline processes are specified in separate files in the form of modules a majority of which are maintained by the nf-core community.Tools used in the same context are combined into subworkflows.They will be added to the nf-core subworkflows collection in the near future allowing further collaboration and shared maintenance across pipelines and beyond the nf-core community.Modularising all tools will enable us to simply do a drop-in replacement when tools should be exchanged for a different one or new ones added as they emerge.nf-core / tools installs them in the appropriate directory, they just need to be called at the appropriate position in a (sub)workflow.Furthermore, the use of modules allows users to customize the released pipeline version at runtime.Before this change, it was necessary to change the underlying code if arguments of a tool were not exposed to the pipeline.This was limiting for users since they had to wait for the feature to be implemented and released before using it.With the new modular config files, arguments can be modified by providing a user-based custom configuration setting the exact command line arguments without changing the underlying code or the release tag.This allows a higher degree of flexibility for the analysis whilst simultaneously using a released version.Reproducibility can then be ensured as before by providing the exact release version, pipeline parameters, and the respective custom config(s).

Figure 1 .
Figure 1.There are many published and countless unpublished variant calling pipelines written in dedicated workflow languages like Nextflow (NF), Snak eMak e (SM) ( 19 ), and Roddy.All pipelines align the reads, duplicate mark them, and employ various QC metrics (see Supplementary TableS1.nf-core / sarek, Ov arflo w, and TOSCA ha v e additional option f or base quality score recalibration.All pipelines allo w v ariant calling and annotation.T he varying supported variant calling types are highlighted for each pipeline respectively.For the One-touch pipelines (OTP), separate workflows have to be triggered for each variant calling type with build-in annotation.nf-core / sarek stands out by covering germline, tumor-only, and paired variant calling f ollo w ed b y annotation across whole genome, whole-e x ome, and panel sequencing data.T he code is a v ailable online and w ell-documented, implemented in Ne xtflo w to enable port abilit y to various infrastr uct ures and supported by an active community.

Figure 2 .
Figure 2. Ov ervie w o v er nf-core / sarek.T he pipeline consists of three sections: pre-processing based on G ATK B est-practice recommendations (mapping, duplicate marking, and base quality score recalibration), variant calling supporting tools for SNP / Indel, SV, CNV, MSI calling and annotation.Throughout the pipeline, various quality control tools are run and collated into a comprehensive MultiQC report.The variant calling tools can be mixed in any combination and are all run in parallel.

Figure 3 .
Figure 3. Resource usage of nf-core / sarek 3.1.1when storing intermediate data in BAM versus CRAM format.(A ) Average realtime, and maximum CPU and memory usage (peak_rss) as reported by the Nextflow trace file for the main processes.For processes split within a sample (i.e.ApplyBQSR), the task with the highest runtime per sample is shown as the process runtime.Resource usage was compared using the paired Wilcoxon test ( ** P < 0.01, * P < 0.05).Two out of the four shown processes are significantly faster when using CRAMs instead of BAMs at the expense of an increase in memory or CPU usage.( B ) Storage was evaluated by calculating the total size of the work directories of all tasks of the respective process.Each condition was repeated three times for samples of five tumor-normal paired patients.

Figure 4 .
Figure 4. Sharding the input FastQ files and parallelizing computation on interval groups reduces the overall runtime of the nf-core / sarek pipeline.( A )Effect of sharding the input files on the mapping processes, including fastP, BWA-MEM and Markduplicates.The input FastQ files were split into smaller pieces increasing the amounts of shards and the runtime, work directory size and CPU hours were evaluated for each split size.FastP was run with a different number of CPUs corresponding to the desired number of shards.( B ) Effect of parallelizing computations across interval groups on BQSR processes, which include the B aseR ecalibrator, GatherBQSRR eports, ApplyBQSR and SAMtools merge process.When all interv als w ere processed together as one group the memory requests for ApplyBQSR had to be increased.The violin plots show computations on tumor-normal paired samples of five patients.The time was evaluated by summing up the highest realtime per task per sample as reported by the Nextflow trace report.The work directory size and CPU hours are the sums of all involved tasks.

Figure 5 .
Figure 5. Germline and somatic variant calling evaluation of high-confidence calls using ground-truth benchmarking data with respect to SNPs. ( A , B )The germline variant track of the pipeline was evaluated using 3 WGS GiaB datasets (HG002-HG004).The average precision, recall, and F1-score values across all the samples are plotted, respectively.( C , D ) The somatic paired variant calling track was evaluated using three tumor-normal WES pairs (EA, FD, NV) from SEQ2C.

Figure 6 .
Figure 6.Copy number calling comparison of calls obtained with tools in sarek (ASCAT, CNVKit, ControlFREEC) and the two available call sets the ICGC portal for five patients from the PCAWG ( 11 ) study.The calls are divided into deletions or amplifications.For each event from each caller, the number of tools supporting it are plotted.

Table 1 .
Datasets used for benchmarking are part of the LICA-FR cohort.The BAM files are downloaded and converted to FastQ files.The respective donor IDs and file sizes of the converted FastQ files are listed

Table 2 .
Average costs per patient on AWS batch for nf-core / sarek version 2.7.2 and 3.1 .1 .All pipeline runs w ere perf ormed with the tools, Strelka2, Manta and VEP.Each analysis run is repeated three times and run on data from the donor DO50970 with either their normal or tumornormal paired sample.The normal sample has a median co v erage of 36X, the tumor sample of 65 ×